Unable to convert text format to proper data frame using Pandas - python-3.x

I am reading text source from URL = 'https://www.census.gov/construction/bps/txt/tb2u201901.txt'
here i used Pandas to convert it into Dataframe
df = pd.read_csv(URL, sep = '\t')
After exporting the df i see all the columns are merged into single column inspite of giving the seperator
as '\t'. how to solve this issue.

As your file is not a CSV file, you should use the function read_fwf() from pandas because your columns have a fixed width. You need also to remove the first 12 lines that are not part of your data and you need to remove the empty lines with dropna().
df = pd.read_fwf(URL, skiprows=12)
df.dropna(inplace=True)
df.head()
United States 94439 58086 1600 1457 33296 1263
1 Northeast 9099.0 3330.0 272.0 242.0 5255.0 242.0
2 New England 1932.0 1079.0 90.0 72.0 691.0 46.0
3 Connecticut 278.0 202.0 8.0 3.0 65.0 8.0
4 Maine 357.0 222.0 6.0 0.0 129.0 5.0
5 Massachusetts 819.0 429.0 38.0 54.0 298.0 23.0

Your output is coming correct . If you open the URL , you will see that there sentences written which are not tab separated so its not able to present in correct way.
From line number 9 the results are correct
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/2K61J.png

Related

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Creating a single file from multiple files (Python 3.x)

I can't figure out a great way to do this, but I have 2 files with a standard date and value format.
File 1 File 2
Date Value Date Value
4 7.0 1 9.0
5 5.5 . .
6 4.0 7 2.0
I want to combine files 1 and 2 to get the following:
Combined Files
Date Value1 Value2 Avg
1 NaN 9.0 9.0
2 NaN 9.0 9.0
3 NaN 8.5 8.5
4 7.0 7.5 7.25
5 5.5 5.0 5.25
6 4.0 3.5 3.75
7 NaN 2.0 2.0
How would I attempt this? I figured I should make a masked array with the date going from 1 to 7 and then just append the files together, but I don't know how I would do that with file 1. Any help where to look would be appreciated.
Using Python 3.x
EDIT:
I solved my own problem!
I am sure there is a better way to streamline this. My solution, doesn't use the example above, I just threw in my code.
def extractFiles(Dir, newDir, newDir2):
fnames = glob(Dir)
farray = np.array(fnames)
## Dates range from 723911 to 737030
dateArray = np.arange(723911,737030) # Store the dates
dataArray = [] # Store the data, This needs to be a list! Not np.array!
for f in farray:
## Extracting Data
CH4 = np.genfromtxt(f, comments='#', delimiter=None, dtype=np.float).T
myData = np.full(dateArray.shape, np.nan) # Create an masked array
myDate = np.array([])
## Converts the given datetime into something more useable
for x, y in zip(*CH4[1:2], *CH4[2:3]):
myDate = np.append(myDate,
(mdates.date2num(datetime.strptime('{}-{}'.format(int(x), int(y)), '%Y-%m'))))
## Finds where the dates are the same and places the approprite concentration value
for i in range(len(CH4[3])):
idx = np.where(dateArray == myDate[i])
myData[idx] = CH4[3, i]
## Store all values in the list
dataArray.append(myData)
## Convert list to numpy array and save in txt file
dataArray = np.vstack((dateArray, dataArray))
np.savetxt(newDir, dataArray.T, fmt='%1.2f', delimiter=',')
## Find the averge of the data to plot
avg = np.nanmean(dataArray[1:].T,1)
avg = np.vstack((dateArray, avg))
np.savetxt(newDir2, avg.T, fmt='%1.2f', delimiter=',')
return avg
Here is my answer based on the information you gave me:
import pandas as pd
import os
# I stored two Excel files in a subfolder of this sample code
# Code
# ----Files
# -------- File1.xlsx
# -------- File2.xlsx
# Here I am saving the path to a variable
file_path = os.path.join(*[os.getcwd(), 'Files', ''])
# I define an empty DataFrame that we then fill we the files information
final_df = pd.DataFrame()
# file_number will be used to increment the Value column based number of files that we load.
# First file will be Value1, second will lead to Value2
file_number = 1
# os.listdir is now "having a look" into the "Files" folder and will return a list of files which is contained in there
# ['File1.xlsx', 'File2.xlsx'] in our case
for file in os.listdir(file_path):
# we load the Excel file with pandas function "read_excel"
df = pd.read_excel(file_path + file)
# Rename the column "Value" to "Value" + the "file_number"
df = df.rename(columns={'Value': 'Value'+str(file_number)})
# Check if the Dataframe already contains values
if not final_df.empty:
# If there is values already then we merge them together with the new values
final_df = final_df.merge(df, how='outer', on='Date')
else:
# Otherwise we "initialize" our final_df with the first Excel file that we loaded
final_df = df
# at the end we increment the file number by one to continue to next file
file_number += 1
# get all column names that have "Value" in it
value_columns = [w for w in final_df.columns if 'Value' in w]
# Create a new column for the average and build the average on all columns that we found for value columns
final_df['Avg'] = final_df.apply(lambda x: x[value_columns].mean(), axis=1)
# Sort the dataframe based on the Date
sorted_df = final_df.sort_values('Date')
print(sorted_df)
The print will output this:
Date Value1 Value2 Avg
3 1 NaN 9.0 9.00
4 2 NaN 9.0 9.00
5 3 NaN 8.5 8.50
0 4 7.0 7.5 7.25
1 5 5.5 5.0 5.25
2 6 4.0 3.5 3.75
6 7 NaN 2.0 2.00
Please be aware that this is not paying attention on the file names and is just loading one file after another based on the alphabet.
But this has the advantage that you can put as many files in there as you want.
If you need to load them in a specific order I can probably help you with that as well.

How to combine several csv in one with identical rows?

I have several csv files with approximately the following structure:
name,title,status,1,2,3
name,title,status,4,5,6
name,title,status,7,8,9
Most of the name columns is the same in all files, only the columns 1,2,3,4... are different.
I need to take turns adding new columns to existing and new rows, as well as updating the remaining rows each time.
For example, I have 2 tables:
name,title,status,1,2,3
Foo,Bla-bla-bla,10,45.6,12.3,45.2
Bar,Too-too,13,13.4,22.6,75.1
name,title,status,4,5,6
Foo,Bla-bla-bla,14,25.3,125.3,5.2
Fobo,Dom-dom,20,53.4,2.9,11.3
And at the output I expect a table:
name,title,status,1,2,3,4,5,6
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
I did not find anything similar, who can tell how I can do this?
It looks like you want to keep just one version of ['name', 'title', 'status'] and from your example, you prefer to keep the last 'status' encountered.
I'd use pd.concat and follow that up with a groupby to filter out duplicate status.
df = pd.concat([
pd.read_csv(fp, index_col=['name', 'title', 'status'])
for fp in ['data1.csv', 'data2.csv']
], axis=1).reset_index('status').groupby(level=['name', 'title']).last()
df
status 1 2 3 4 5 6
name title
Bar Too-too 13 13.4 22.6 75.1 NaN NaN NaN
Fobo Dom-dom 20 NaN NaN NaN 53.4 2.9 11.3
Foo Bla-bla-bla 14 45.6 12.3 45.2 25.3 125.3 5.2
Then df.to_csv() produces
name,title,status,1,2,3,4,5,6
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Keep merging them:
df = None
for path in ['data1.csv', 'data2.csv']:
sub_df = pd.read_csv(path)
if df is None:
df = sub_df
else:
df = df.merge(sub_df, on=['name', 'title', 'status'], how='outer')

How do I fill these `NaN` values properly?

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?
Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.
It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns

The charts below show my basic challenge: subtract NUMBER OF STOCKS WITH DATA END from NUMBER OF STOCKS WITH DATA START. The challenge I am having is that the date range for each series does not match so I need to merge both sets to a common date range, perform the subtraction, and save results to a new comma seperated value file.
Input data in file named 'meta.csv' contains 3187 lines. Fields per line are data for ticker, start, & end. Head and tail as shown here:
0000 ticker,start,end
0001 A,1999-11-18,2016-12-27
0002 AA,2016-11-01,2016-12-27
0003 AAL,2005-09-27,2016-12-27
0004 AAMC,2012-12-13,2016-12-27
0005 AAN,1984-09-07,2016-12-27
...
3183 ZNGA,2011-12-16,2016-12-27
3184 ZOES,2014-04-11,2016-12-27
3185 ZQK,1990-03-26,2015-09-09
3186 ZTS,2013-02-01,2016-12-27
3187 ZUMZ,2005-05-06,2016-12-27
Python code and console output:
import pandas as pd
df = pd.read_csv('meta.csv')
s = df.groupby('start').size().cumsum()
e = df.groupby('end').size().cumsum()
#s.plot(title='NUMBER OF STOCKS WITH DATA START',
# grid=True,style='k.')
#e.plot(title='NUMBER OF STOCKS WITH DATA END',
# grid=True,style='k.')
print(s.head(5))
print(s.tail(5))
print(e.tail(5))
OUT:
start
1962-01-02 11
1962-11-19 12
1970-01-02 30
1971-08-06 31
1972-06-01 54
dtype: int64
start
2016-07-05 3182
2016-10-04 3183
2016-11-01 3184
2016-12-05 3185
2016-12-08 3186
end
2016-12-08 544
2016-12-15 545
2016-12-16 546
2016-12-21 547
2016-12-27 3186
dtype: int64
Chart output when comments removed for code shown above:
I want to create one population file with the date and number of stocks with active data which should have a head and tail shown as follows:
date,num_stocks
1962-01-02,11
1962-11-19,12
1970-01-02,30
1971-08-06,31
1972-06-01,54
...
2016-12-08,2642
2016-12-15,2641
2016-12-16,2640
2016-12-21,2639
2016-12-27,2639
The ultimate goal is to be able to plot the number of stocks with data over any specified date range by reading the population file.
To align the dates with their respective counts. I'd take the difference of pd.Series.value_counts
df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
1984-09-07 1.0
1990-03-26 1.0
1999-11-18 1.0
2005-05-06 1.0
2005-09-27 1.0
2011-12-16 1.0
2012-12-13 1.0
2013-02-01 1.0
2014-04-11 1.0
2015-09-09 -1.0
2016-11-01 1.0
2016-12-27 -9.0
dtype: float64
Thanks to the crucial tip provided by piRSquared I solved the challenge using this code:
import pandas as pd
df = pd.read_csv('meta.csv')
x = df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
x.iloc[-1] = 0
r = x.cumsum()
r.to_csv('pop.csv')
z = pd.read_csv('pop.csv', index_col=0, header=None)
z.plot(title='NUMBER OF STOCKS WITH DATA',legend=None,
grid=True,style='k.')
'pop.csv' file head/tail:
1962-01-02 11.0
1962-11-19 12.0
1970-01-02 30.0
1971-08-06 31.0
1972-06-01 54.0
...
2016-12-08 2642.0
2016-12-15 2641.0
2016-12-16 2640.0
2016-12-21 2639.0
2016-12-27 2639.0
Chart:

Resources